Sample-Optimal Identity Testing with High Probability

نویسندگان

  • Ilias Diakonikolas
  • Themis Gouleakis
  • John Peebles
  • Eric Price
چکیده

We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0 < ε, δ < 1, we wish to distinguish, with probability at least 1 − δ, whether the distributions are identical versus ε-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that δ = Ω(1), for which the sample complexity of identity testing is known to be Θ( √ n/ε). Typical applications of distribution property testing require small values of the confidence parameter δ (which correspond to small “p-values” in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of δ via black-box amplification, which multiplies the required number of samples by Θ(log(1/δ)). We show that this upper bound is suboptimal for any δ = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Θ ( 1 ε2 (√ n log(1/δ) + log(1/δ) )) for any n, ε, and δ. For the special case of uniformity testing, where the given distribution is the uniform distribution Un over the domain, our new tester is surprisingly simple: to test whether p = Un versus dTV (p, Un) ≥ ε, we simply threshold dTV (p̂, Un), where p̂ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well. ∗Supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. †Supported by the NSF under Grant No. 1420692. ‡Supported by the NSF Graduate Research Fellowship under Grant No. 1122374, and by the NSF under Grant No. 1065125.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wasserstein Identity Testing

Uniformity testing and the more general identity testing are well studied problems in distributional property testing. Most previous work focuses on testing under L1-distance. However, when the support is very large or even continuous, testing under L1-distance may require a huge (even infinite) number of samples. Motivated by such issues, we consider the identity testing in Wasserstein distanc...

متن کامل

Optimal Testing for Properties of Distributions

Given samples from an unknown distribution p, is it possible to distinguish whether p belongs to some class of distributions C versus p being far from every distribution in C? This fundamental question has received tremendous attention in statistics, focusing primarily on asymptotic analysis, and more recently in information theory and theoretical computer science, where the emphasis has been o...

متن کامل

Differentially Private Testing of Identity and Closeness of Discrete Distributions

We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over k elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problem...

متن کامل

Testing Bayesian Networks

This work initiates a systematic investigation of testing high-dimensional structured distributions by focusing on testing Bayesian networks – the prototypical family of directed graphical models. A Bayesian network is defined by a directed acyclic graph, where we associate a random variable with each node. The value at any particular node is conditionally independent of all the other nondescen...

متن کامل

Fourier-Based Testing for Families of Distributions

We study the general problem of testing whether an unknown discrete distribution belongs to a given family of distributions. More specifically, given a class of distributions P and sample access to an unknown distribution P, we want to distinguish (with high probability) between the case that P ∈ P and the case that P is ǫ-far, in total variation distance, from every distribution in P . This is...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Electronic Colloquium on Computational Complexity (ECCC)

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2017